Draft

Spam Detection

NLP
code
NLP
Natural Language, Processing
Python, R
Author

Simone Brazzi

Published

September 5, 2024

1 Introduction

Tasks:

  • Train a model to classify spam.
  • Find the main topic in the spam emails.
  • Calculate semantic distance of the spam topics.
  • Extract the ORG from ham emilas.

2 Import

2.1 R libraries

Code
library(tidyverse, verbose = FALSE)
library(ggplot2)
library(gt)
library(reticulate)
library(plotly)

# renv::use_python("/Users/simonebrazzi/venv/blog/bin/python3")

2.2 Python packages

Code
import pandas as pd
import numpy as np
import spacy
import nltk
import string
import gensim
import gensim.corpora as corpora
import gensim.downloader

from nltk.corpus import stopwords
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# glove vector
glove_vector = gensim.downloader.load("glove-wiki-gigaword-300")
# nltk stopwords
nltk.download('stopwords')
True
Code
stop_words = stopwords.words('english')

2.3 Config class

Code
class Config():
  def __init__(self):
    
    """
    CLass initialize function.
    """
    
    self.path="/Users/simonebrazzi/R/blog/posts/spam_detection/spam_dataset.csv"
    self.random_state=42
  
  def get_lemmas(self, doc):
    
    """
    List comprehension to get lemmas. It performs:\n
      1. Lowercase.\n
      2. Stop words removal\n.
      3. Whitespaces removal.\n
      4. Remove the word 'subject'.\n
      5. Digits removal.\n
      6. Get only words with length >= 5.\n
    """
    
    lemmas = [
      [
        t.lemma_ for t in d 
        if (text := t.text.lower()) not in punctuation 
        and text not in stop_words 
        and not t.is_space 
        and text != "subject" 
        and not t.is_digit
        and len(text) >=5
        ]
        for d in doc
        ]
    return lemmas
  
  def get_entities(self, doc):
    
    """
    List comprehension to get the lemmas which have entity type 'ORG'.
    """
    
    entities = [
    [
      t.lemma_ for t in d 
      if (text := t.text.lower()) not in punctuation 
      and text not in stop_words 
      and not t.is_space 
      and text != "subject" 
      and not t.is_digit
      and len(text) >=5
      and t.ent_type_ == "ORG"
      ]
      for d in doc
      ]
    return entities
  
  def get_sklearn_mlp(self, activation, solver, max_iter, hidden_layer_sizes, tol):
    
    """
    It initialize the sklearn MLPClassifier.
    """
    
    mlp = mlp = MLPClassifier(
      activation=activation,
      solver=solver,
      max_iter=max_iter,
      hidden_layer_sizes=hidden_layer_sizes,
      tol=tol,
      verbose=True
      )
    return mlp
  
  def get_lda(self, corpus, id2word, num_topics, passes, workers, random_state):
    
    """
    Initialize LDA.
    """
    
    lda = gensim.models.LdaMulticore(
      corpus=corpus,
      id2word=id2word,
      num_topics=num_topics,
      passes=passes,
      workers=workers,
      random_state=self.random_state
      )
    return lda
  
config = Config()

3 Dataset

Code
df = pd.read_csv(config.path, index_col=0)

To further perform analysis also in R, here we assign the df to a variable using reticulate library.

Code
library(reticulate)
df <- py$df

3.1 EDA

First things first, Exploratory Data Analysis. Considering we are going to perform a classification, is interesting to check if our dataset is unbalanced.

Code
df_g <- df %>% 
  summarise(
    freq_abs = n(),
    freq_rel = n() / nrow(df),
    .by = label
    )

df_g %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_width(
    label ~ pct(20),
    freq_abs ~ pct(35),
    freq_rel ~ pct(35)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(freq_abs, freq_rel)
  ) %>% 
  tab_header(
    title = "Label frequency",
    subtitle = "Absolute and relative frequencies"
  ) %>% 
  cols_label(
    label = "Label",
    freq_abs = "Absolute frequency",
    freq_rel = "Relative frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 1: Absolute and relative frequencies
Label frequency
Absolute and relative frequencies
Label Absolute frequency Relative frequency
ham 3,672  0.71
spam 1,499  0.29

It is useful also a visual cue.

Code
g <- ggplot(df_g) +
  geom_col(aes(x = freq_abs, y = label, fill = label)) +
  scale_fill_brewer(palette = "Set1", direction = -1) +
  theme_minimal() +
  ggtitle("Absolute frequency barchart") +
  xlab("Absolute frequency") +
  ylab("Label") +
  labs(fill = "Label")
ggplotly(g)
Figure 1: Absolute frequency barchart

The dataset is not balanced, so it could be relevant when training the model for classification. Depending on the model performance, we know what we could investigate first.

4 Preprocessing

In case of text, the preprocessing is fundamental: the computer does not understand the semantic or grammatical meaning of words. In case of text preprocessing, we follow the following steps:

  1. Lowercasing.
  2. Punctuation removal.
  3. Lemmatization.
  4. Tokenization.
  5. Stopwords removal.

Using SpaCy, we can applied these steps. The method nlp.pipe improves the performance and returns a generator. It yields a Doc objects, not a list. To use it as a list, it has to be defined as such. To speed up the process, is it possible to enable the multi process method in nlp.pipe. But, what does the variable nlp stand for? It load a spaCy model: we are going to use the en_core_web_lg.

  • Language: EN; english.
  • Type: CORE; vocabulary, syntax, entities, vectors
  • Genre: WEB; written text (blogs, news, comments).
  • Size: LG; large (560 mB).

Check this link for the documentation about this model.

I had chosen this model, even if it is the biggest, to get the full potential of it.

Code
# load en_core_web_lg
nlp = spacy.load("en_core_web_lg")
doc = list(nlp.pipe(df.text, n_process=4))

The specific preprocessing in this case should check these steps:

  1. Remove punctuation.
  2. Remove stop words.
  3. Remove spaces.
  4. Remove “subject” token.
  5. Lemmatization.

To improve code performance, these are the most noticible points:

  • When iterating over a collection of unique elements, set() performs better then list(). The underlying hash table structure allows for swift traversal. This is particularly noticible when the df dimension increase.
  • List comprehension, which performs better then for loops and are much more readable in some context.
  • The walrus operator :=. It is a syntax which lets assign variables in the middle of expressions. It avoids redundant calculations and improves readability.
Code
punctuation = set(string.punctuation)
stop_words = set(stop_words)

lemmas = config.get_lemmas(doc)
entities = config.get_entities(doc)

# assignment to a df column
df["lemmas"] = lemmas
df["entities"] = entities

5 Tasks

5.1 Classification

The text is already preprocessed as list of lemmas. For the classification task, it is necessary to convert it as a string.

Code
df["feature"] = df.lemmas.apply(lambda x : " ".join(x))

5.1.1 Features

As said, the machine does not understand human readable text. It has to be transformed. The best approach is to vectorize it with TfidfVectorizer(). It is a tool for converting text into a matrix of TF-IDF features. The TermFrequency-InverseDocumentFrequency is a statistical method. It is a measure of importance of a word in a document, part of a corpus, adjusted for the frequency in the corpus. The model vectorize a word by multiplying the word Term Frequency

\[ TF = \frac{word\ frequency\ in\ document}{total\ words\ in\ document} \] with the Inverse Document Frequency

\[ IDF = log(\frac{total\ number\ documents}{documents\ containing\ the\ word}) \] The final result is

\[ TF-IDF = TF * IDF \]

The resulting score represents the importance of a word. It dependes on the word frequency both in a specific document and in the corpus.

::: {.callout-note collapse:“true”} An example can be useful. If a word t appears 20 times in a document of 100 words, we have

\[ TF = \frac{20}{100}=0.2 \]

If there are 10.000 documents in the corpus and 100 documents contains the term t

\[ IDF = log(\frac{10000}{100})=2 \]

This means the score is

\[ TF-IDF=0.2*2=0.4 \] :::

Code
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['feature'])
y = df.label

5.2 Split

Not much to say about this: a best practice which let evaluate the performance of our model on new data.

Code
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, random_state=42)

5.2.1 Model

The model is the MLPClassifier(). It is a Multi Perceptron Layer Classifier.

MLPClassifier

It is an Artificial Neural Network used for classification. It consists of multiple layers of nodes, called perceptrons. For further reading, see the documentation.

Code
mlp = config.get_sklearn_mlp(
  activation="logistic",
  solver="adam",
  max_iter=100,
  hidden_layer_sizes=(100,),
  tol=.005
  )

5.2.2 Fit

Code
mlp.fit(xtrain, ytrain)
MLPClassifier(activation='logistic', max_iter=100, tol=0.005, verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

5.2.3 Predict

Code
ypred = mlp.predict(xtest)

5.2.4 Classification report

Considering we are doing a classificatoin, one method to evaluate the performance is the classification report. It summarize the performance of the model comparing true and predicted labels, showing not only the metrics (precision, recall and F1-score) but also the support.

Code
cr = classification_report(
  ytest,
  ypred,
  target_names=["spam", "ham"],
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)

df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  ) %>% 
  gt() %>%
  tab_header(
    title = "Confusion Matrix",
    subtitle = "Sklearn MLPClassifier"
  ) %>% 
  fmt_number(
    columns = c("precision", "recall", "f1-score", "support"),
    decimals = 2,
    drop_trailing_zeros = TRUE,
    drop_trailing_dec_mark = FALSE
  ) %>% 
  cols_align(
    align = "center",
    columns = c("precision", "recall", "f1-score", "support")
  ) %>% 
  cols_align(
    align = "left",
    columns = metrics
  ) %>% 
  cols_label(
    metrics = "Metrics",
    precision = "Precision",
    recall = "Recall",
    `f1-score` = "F1-Score",
    support = "Support"
  )
Confusion Matrix
Sklearn MLPClassifier
Metrics Precision Recall F1-Score Support
spam 0.99 0.98 0.99 742.
ham 0.96 0.97 0.97 293.
accuracy 0.98 0.98 0.98 0.98
macro avg 0.97 0.98 0.98 1,035.
weighted avg 0.98 0.98 0.98 1,035.

Even if the model is not fitted for an unbalanced dataset, it is not affecting the performance. Precision and Recall are high, so much that is could seems to be overfitted.

5.3 Topic Modeling for spam content

Topic modeling in nlp can count on the Latent Dirilicht Model. It is a generative model used to get the topics which occur in a set of documents.

The LDA model has:

  • Input: a corpus of text documents, preprocessed as tokenized and cleaned words. We have this in the lemmas column.
  • Output: a distribution of topics for each document and one of words for each topic.

For further reading, you can find the paper in the Table of Contents or at this link.

5.3.1 Dataset

Filter data to have a spam dataframe and create a variable with the lemmas column to work with.

Code
spam = df[df.label == "spam"]
x = spam.lemmas

5.3.2 Create corpus

LDA algortihm needs the corpus as a bag of word.

Code
id2word = corpora.Dictionary(x)
corpus = [id2word.doc2bow(t) for t in x]

5.3.3 Model

The model will return a user defined number of topics. For each of it, it will return a user defined number of words ad the probability of each of them.

Code
lda = config.get_lda(
  corpus=corpus,
  id2word=id2word,
  num_topics=10,
  passes=10, # number of times the algorithm see the corpus
  workers=4, # parellalize
  random_state=42
  )
topic_words = lda.show_topics(num_topics=10, num_words=5, formatted=False)

# Iterate over topic_words to extract the data
data = []
data = [
    (topic, w, p)
    for topic, words in topic_words
    for w, p in words
    ]
topics_df = pd.DataFrame(data, columns=['topic', 'word', 'proba'])
Code
py$topics_df %>% 
  gt() %>% 
  tab_header(
    title = "Words and probabilities by topics"
  ) %>% 
  fmt_auto() %>% 
  cols_width(
    topic ~ pct(33),
    word ~ pct(33),
    proba ~ pct(33)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(topic, word, proba)
  ) %>% 
  cols_label(
    topic = "Topic",
    word = "Word",
    proba = "Probability"
  ) %>% 
  tab_options(
    table.width = pct(100)
  )
Table 2: Topics id, words and probabilities
Words and probabilities by topics
Topic Word Probability
height 0.009
family 0.007
email 0.006
style 0.006
number 0.006
company 0.009
account 0.007
email 0.007
million 0.006
money 0.005
price 0.018
height 0.011
width 0.011
align 0.01 
color 0.008
pill 0.029
viagra 0.007
health 0.005
xanax 0.004
valium 0.004
source 0.003
normally 0.003
think 0.003
palestinian 0.002
dosage 0.002
company 0.021
statement 0.015
stock 0.012
information 0.009
security 0.009
computron 0.015
remove 0.011
contact 0.011
message 0.01 
please 0.009
hotlist 0.007
moopid 0.007
image 0.005
width 0.005
please 0.004
email 0.006
weight 0.003
without 0.003
people 0.003
please 0.002
cialis 0.009
offer 0.005
hour 0.005
product 0.004
money 0.004

5.4 Semantic distance between topics

Create a dict to create a documents to vectorize.

Code
topics_dict = {}
for topics, words in topic_words:
  topics_dict[topics] = [w[0] for w in words]
Code
documents = [" ".join(words) for words in topics_dict.values()]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
cosine_sim_matrix = cosine_similarity(tfidf_matrix, dense_output=True)
topics = list(topics_dict.keys())
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=topics, columns=topics)
Code
py$cosine_sim_df %>% 
  gt() %>% 
  fmt_auto() %>% 
  cols_align(
    align = "center",
    columns = c(`0`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`)
  ) %>% 
  cols_label(
    `0` = "Topic 0",
    `1` = "Topic 1",
    `2` = "Topic 2",
    `3` = "Topic 3",
    `4` = "Topic 4",
    `5` = "Topic 5",
    `6` = "Topic 6",
    `7` = "Topic 7",
    `8` = "Topic 8",
    `9` = "Topic 9",
  ) %>% 
  tab_options(
    table.width = pct(100)
    )
Table 3: Cosine similarity matrix
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9
1     0.134 0.166 0     0     0     0.132 0    
0.134 1     0     0.166 0     0     0.137 0.166
0.166 0     1     0     0     0.166 0     0    
0     0     0     0     0     0     0     0    
0     0     0     0     0     0     0     0    
0     0.166 0     1     0     0     0     0    
0     0     0     0     1     0.125 0.128 0    
0     0     0.166 0     0.125 1     0.132 0    
0.132 0.137 0     0     0.128 0.132 1     0    
0     0.166 0     0     0     0     0     1    

5.5 Organization of “HAM” mails

5.5.1 Create “HAM” df

5.5.2 Get ham lemmas which have ORG entity

Code
ham = df[df.label == "ham"]
x = ham.entities
Code
from collections import Counter

# Flatten the list of lists and create a Counter object
d = Counter([i for e in x for i in e])

freq_df = pd.DataFrame(d.items(), columns=["word", "freq"])
Code
library("wordcloud")
Loading required package: RColorBrewer
Code
library("RColorBrewer")

set.seed(42)
word_freqs_df <- py$freq_df

wordcloud(
  words = word_freqs_df$word,
  freq = word_freqs_df$freq,
  min.freq = 1,
  max.words = 100, # nrow(word_freqs_df),
  random.order = FALSE,
  rot.per = .3,
  colors = brewer.pal(n = 8, name = "Dark2")
  )
Figure 2: Wordcloud
Code
word_freqs_df %>% 
  arrange(desc(freq)) %>% 
  head(10) %>% 
  gt() %>%                                        
  tab_header(
    title = "Top 10 ham words ",
    subtitle = "by frequency"
  ) %>% 
  fmt_auto() %>% 
  cols_width(
    word ~ pct(50),
    freq ~ pct(50)
    ) %>% 
  cols_align(
    align = "center",
    columns = c(word, freq)
  ) %>% 
  cols_label(
    word = "Word",
    freq = "Frequency"
  ) %>% 
  tab_options(
    table.width = pct(100)
  )
Table 4: Top 10 words by frequency
Top 10 ham words
by frequency
Word Frequency
enron 3,268 
daren 1,047 
farmer   472 
north   270 
america   269 
texas   189 
iferc   177 
energy   151 
clynes   144 
sitara   137